## University of Western Ontario, Computer Science Department CS3350B, Computer Organization

Assignment 4 Due: April 7, 2023

General Instructions: This assignment consists of 5 pages, 4 exercises and is marked out of 100. For any question involving calculations you must provide your workings. You may collaborate with other students in the class in the sense of general strategies to solve the problems. But each assignment and the answers within are to be solely individual work and completed independently. Any plagiarism found will be taken seriously and may result in a mark of 0 on this assignment, removal from the course, or more serious consequences.

**Submission Instructions:** The answers to this assignment are to be submitted on OWL as a **single PDF file**. Ideally, the answers are to be typed. At the very least, clearly *scanned* copies (no photographs) of hand-written work. If the person correcting your assignment is unable to easily read or interpret your answer then it may be marked as incorrect without the possibility of remarking.

## **Useful Facts:**



Figure 1: MIPS Datapath with Interstage Registers

Exercise 1. [20 Marks] Consider the multi-cycle MIPS datapath presented in Figure 1, it shows 4 inter-stage registers: IF/ID, ID/EX, EX/MEM, and MEM/WB. Consider also the control signals presented in the diagram in blue. Assume the ALUOp control signal is 3 bits. Ignore control signals not shown (i.e. the ones controlling forwarding). Determine the minimum size, in bits, of each of the four inter-stage registers. For part marks, ensure to include workings and not just the total values.

**Exercise 2.** [15 Marks] Consider a pipelined processor with N stages. Each stage except the last runs in t units of time (say pico-seconds) meanwhile the last stage runs in  $m \times t$  units of time where m > 1. Thus, the last stage is slower than the other ones. Assume there are c independent and hazard-free instructions being processed by this pipeline.

- (a) Compute an expression for the actual pipeline speedup based on the variables N, t, m, and c. Show your workings.
- (b) Compute an expression for the ratio of time for which the pipeline runs at *full occupancy* based on the variables N, t, m, and c. Show your workings. Recall full occupancy means every pipeline stage is actively computing work.

## Exercise 3.

Consider the following C code:

```
for (i = 0; i < n; ++i)
a[i] = b[i] + i:
```

where a and b are 32-bit integer arrays of size n.

We have a corresponding MIPS instruction sequence:

```
\# $t0 = 0, which corresponds to i in C code
        add $t0, $0, $0
       lw $s1, 0($s4)
                              # assume $s4 stores the base address of array b
loop:
        add $s0, $s1, $t0
                              \# \$s0 \text{ gets b[i]} + i
        sw $s0, 0($s2)
                              # assume $s2 stores the base address of array a
        addi $t0, $t0, 1
                              \# ++i
                              # get address of a[i+1]
        addi $s2, $s2, 4
        addi $s4, $s4, 4
                              # get address of b[i+1]
        slt $t2, $t0, $s5
                              # assume that $s5 holds n
        bne $t2, $0, loop
                             \# if \$t2 == 1, go to loop
```

Assume that the above MIPS instructions will be executed on a 5-stage pipelined processor. Ignore structural hazards and assume the branch condition is computed in the EX stage.

- (a) [10 Marks] Draw the pipeline execution diagram for one iteration of the loop, that is, starting at the label loop. Assume there is no data forwarding. Compute the CPI (clock cycles per instruction) for that one iteration. Do not re-order the instructions.
- (b) [10 Marks] Draw the pipeline execution diagram for one iteration of the loop, that is, starting at the label loop. This time, assume all possible data forwarding is used. For each instance of data forwarding, annotate your pipeline diagram (say, with arrows, colors, or footnotes) to indicate the source and destination of the forwarding; also say what kind of forwarding it is. Compute the CPI (clock cycles per instruction) for that one iteration. Do not re-order the instructions.
- (c) [5 Marks] Apply loop unrolling (as well as instruction re-ordering, if you like) on the above MIPS code for two iterations. Write out the corresponding MIPS instruction code. Make sure to use offsets appropriately to avoid unnecessary instructions.

(d) [15 Marks] Consider a 2-issue extension of MIPS. That is, a VLIW extension of MIPS where two instructions occur in each instruction word. The issue packet has the following format: the first instruction must be arithmetic or branch, and the second instruction must be a data transfer instruction (1w or sw). Still assume all types of forwarding.

Using the code of your unrolled loop of part (c), statically schedule one iteration of the (now unrolled) loop to run optimally on this 2-issue machine. You may wish to modify your code from part (c) to have a different order and use additional registers to accomplish this task. Consider using the following table as a starting point and to help guide you through the process

|       | ALU or branch | Data transfer | CC |
|-------|---------------|---------------|----|
| loop: |               |               | 1  |
|       |               |               | 2  |
|       |               |               | 3  |
|       |               |               | 4  |
|       |               |               | 5  |
|       |               |               | 6  |
|       |               |               | 7  |
|       |               |               | 8  |
|       |               |               | 9  |
|       |               |               | :  |

Exercise 4. [25 Marks] Consider a multi-core processor with 2 cores, named P1 and P2. Each core has a dedicated cache with the following characteristics:

- 2-way set associative and a 16-byte capacity;
- is initially empty;
- follows the MESI snooping protocol;
- follows write-back and write-allocate protocols; and
- follows a pseudo-LRU replacement policy where
  - (i) empty cache lines in a set are filled first, then,
  - (i) if there are any invalid cache lines in a set replace them, then,
  - (ii) if no invalid cache lines are present, follows a typical LRU replacement policy.

Given the following list of serialized memory byte address accesses by the cores, determine:

- (a) whether each access results in a cache hit, cold miss, conflict miss, capacity miss, true share miss, or false share miss;
- (b) the data stored in each cache after all addresses in the list have been accessed; and
- (c) the MESI state of each cache block after all addresses in the list have been accessed.

| Time | Memory Access | Hit/Miss Type | Time | Memory Access | Hit/Miss Type |
|------|---------------|---------------|------|---------------|---------------|
| 1    | P1 Reads 5    |               | 11   | P2 Writes 9   |               |
| 2    | P2 Writes 8   |               | 12   | P2 Writes 10  |               |
| 3    | P1 Reads 9    |               | 13   | P2 Reads 2    |               |
| 4    | P1 Writes 14  |               | 14   | P1 Writes 7   |               |
| 5    | P1 Reads 3    |               | 15   | P1 Reads 8    |               |
| 6    | P1 Writes 12  |               | 16   | P1 Reads 4    |               |
| 7    | P2 Reads 6    |               | 17   | P2 Reads 12   |               |
| 8    | P2 Reads 17   |               | 18   | P2 Reads 7    |               |
| 9    | P1 Reads 20   |               | 19   | P1 Writes 2   |               |
| 10   | P2 Reads 4    |               | 20   | P1 Reads 11   |               |

To answer part (a) use the above table. To answer parts (b) and (c), use the below tables.

P1 Cache

| Set | Cache Block Data |  | State |
|-----|------------------|--|-------|
| 0   |                  |  |       |
|     |                  |  |       |
| 1   |                  |  |       |
|     |                  |  |       |
| 2   |                  |  |       |
|     |                  |  |       |
| 3   |                  |  |       |
|     |                  |  |       |

P2 Cache

|  | Set | Cache Block Data |  | State |
|--|-----|------------------|--|-------|
|  | 0   |                  |  |       |
|  |     |                  |  |       |
|  | 1   |                  |  |       |
|  |     |                  |  |       |
|  | 2   |                  |  |       |
|  |     |                  |  |       |
|  | 3   |                  |  |       |
|  |     |                  |  |       |